Methods for monitoring the expression of alternatively spliced genes

ABSTRACT

Methods, probe arrays and computer software products are provided for determining the arrangement of sequence elements. In one embodiment, methods for making and using exon chips are provided. The exon chips may be used to identify and quantify splice variants.

This application claims the benefit of U.S. Provisional Application Nos.60/362,315, 60/362,456, 60/362,524, 60/362,454, 60/362,455, 60/362,399,60/433,135, 60/433,225, 60/422,220, 60/398,958 and 60/384,552. All thecited applications are incorporated herein by reference for allpurposes.

BACKGROUND OF THE INVENTION

U.S. Pat. Nos. 5,424,186 and 5,445,934 describe a pioneering techniquefor, among other things, forming and using high density arrays ofmolecules such as oligonucleotide, RNA, peptides, polysaccharides, andother materials. The patents are hereby incorporated by reference forall purposes. Arrays of oligonucleotides or peptides, for example, areformed on the surface by sequentially removing a photoremovable groupfrom a surface, coupling a monomer to the exposed region of the surface,and repeating the process. These techniques have been used to formextremely dense arrays of oligonucleotides, peptides, and othermaterials. Such arrays are useful in, for example, drug development,gene expression monitoring, genotyping, and a variety of otherapplications.

The development of the nucleic acid probe array technology providesmeans for studying the complex regulation of expression of a largenumber of genes. U.S. Pat. No. 6,040,138, for example, describes theprocess for monitoring the expression of a large number of genes. Oneimportant aspect of gene expression regulation is the alternativesplicing, a process by which different mRNAs are generated from a singlegene. In some cases, the expression of a single gene can result in alarge number of different mRNAs, hence, large number of differentfunctioning proteins. For example, it has been shown that 64 differentmRNA variants may be generated from a single gene. Alternative splicingis a very common regulatory mechanism. According to one estimate, atleast 30% of the genes are alternatively spliced. Monitoring alternativesplicing will therefore provide information for drug discovery, therapymonitoring, and diagnostics. Therefore, there is a great need in the artfor methods for more efficiently determining alternatively spliced mRNA.

SUMMARY OF THE INVENTION

Accordingly, this invention provides methods, compositions, and computersoftware for analyzing sequence variations such as products ofalternative splicing. These methods, compositions and computer softwareproducts of the invention are particularly useful for analyzing largenumber of alternatively spliced mRNAs. In some embodiments, methods,compositions and computer software for making and using Exon Chips areprovided. The Exon Chips of the invention are particularly useful foranalyzing gene regulation by alternative splicing, alternativepromoters, RNA editing, etc. However, the utility of the Exon Chips arenot limited to analyzing gene regulation. These chips may in general beused to analyze the arrangement of sequence elements (e.g., exons). Inaddition to being able to identify the specific sequence arrangements ina biological sample, the exon chip probe arrays of the invention arealso useful for quantifying the specific sequences. Such probe arraysmay be used to better understand the expression of genes, particularlythose genes that are regulated by alternative splicing, alternativepromoters, RNA editing, etc.

In one aspect of the invention, a nucleic acid probe array comprising aset of probes for interrogating the joining sequence between a firstsequence element and a second sequence element is provided. In someembodiments, the probes on the probe array are oligonucleotides. Thefirst sequence element may be a first exon and the second sequenceelement may be a second exon. The joining sequence is the portion of thesequence neighboring the junction between the first and second sequence.If the sequence elements are exons, the joining sequence is the 3′sequence of one exon and 5′ sequence of another exon. The joiningsequence should be at least 20 bases in length, preferably at least 30bases in length, more preferably at least 40 bases in length, even morepreferably at least 50 bases and most preferably 100 bases in length.

In some preferred embodiments, the set of probes are immobilized on asubstrate at a density of at least 100 probes/cm², preferably at least1000, more preferably at least 2000 probes/cm². The array may containprobes designed to quantify the sequence elements. For example, thearray may contain probes targeting the internal sequence of exons.Optionally, control probes of various types may be included on thearrays of the invention.

In another aspect of the invention, a method for determining targetsequence wherein said target sequence comprises a first sequence elementjoining a second sequence element is provided. In some embodiments, themethod involves hybridizing a target sequence with a nucleic acid probearray having a set of probes for interrogating the joining sequencebetween a first sequence element and a second sequence element, andobtaining information about the joining sequence based upon thehybridization of the target sequence with the set of probes. The firstand second sequence elements may be exons. The set of nucleic acidprobes may be oligonucleotide probes immobilized on a substrate,preferably at a density of at least 100 probes/cm². In some embodiments,target sequence is a mRNA. The mRNA may be one of at least twoalternatively spliced mRNAs transcribed from a gene. The method may alsoinclude the step of quantifying the first and second sequence elementsusing information about the joining sequence and said hybridization.

In some embodiments, the nucleic acid probe array of the invention mayhave additional sequence probes against the first and second sequenceelements. The quantification may be based upon the hybridization oftarget sequence and sequence probes against the internal sequence of thefirst and second sequence elements. The probes for interrogating areprobes for tiling the joining sequence which should be at least 20 basesin length, preferably at least 30 bases, more preferably at least 40bases, and even more preferably at least 50 bases and most preferably atleast 100 bases.

In yet another aspect of the invention, a computer software product isprovided. The product may include computer code that receives aplurality of hybridization signals, wherein each of the plurality ofsignals reflects the hybridization of one of plurality of tiling probesto interrogate the joining sequence of a target sequence wherein thetarget sequence has at least one sequence element that is selected froma group of at least two sequence elements; b) Computer code thatidentifies the sequence element based upon said hybridization signals;and c) a computer readable media that stores said codes. The tilingprobes are oligonucleotides immobilized on a substrate. The tilingprobes interrogate at least 20 bases, preferably at least 30 bases, morepreferably least 40 bases, even more preferably at least 50 bases andmost preferably at least 100 bases. The computer software may includecomputer code for quantifying a target sequence.

In yet another aspect, methods for designing probes for detecting thecombination of two sequence elements are provided. In some embodiments,the methods include inputting the sequence of the joining region betweentwo sequence elements; and selecting probes for tiling the said joiningregion based upon the sequence of the joining region. In preferredembodiments, sequence elements are exons. In some embodiments, themethod of the invention also include a step of designing lithographicmask where lithographic mask is used in the fabrication of arrays ofnucleic acid probes. In some other embodiments, the method of theinvention include a step of output signals for controlling an ink-jetprinting mechanism for depositing compounds on a substrate. The sequenceof the joining region to be interrogated is at least 20 bases,preferably at least 30 bases, more preferably at least 40 bases, evenmore preferably at least 50 bases and most preferably at least 100bases.

Computer software products for designing exon chips of the invention arealso provided. In some embodiments, the computer software productinclude computer program code that constructs a joining sequence;computer program code that selects tiling probes to interrogate thejoining sequence; and a computer readable media that stores said codes.The joining sequence may be for one of alternatively spliced mRNAs. Insome embodiments, the computer software product also include computercode that inputs exon sequences. The joining sequence is constructedbased upon the exon sequences. The computer software product may includecode that outputs sequence of the probes.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows alternative splicing.

FIG. 2 shows detection of combination of sequence elements.

FIG. 3 shows detection of alternative splicing.

FIG. 4 shows detection of more complex alternative splicing.

FIG. 5 shows the process for designing an exon chip.

FIG. 6 shows the process for analyzing data from an exon chip.

DESCRIPTION OF THE PREFERRED EMBODIMENTS

A mRNA is often the result of the combination of sequence elements. Forexample, a mature mRNA may be the result of RNA splicing where sequencestranscribed from introns are removed. The combination of the sequenceelements may be configured in alternative format. In some embodiments ofthe invention, methods, compositions, computer software products andsystems are provided to identify the configuration (arrangement ofsequence elements, such as exons) of nucleic acids. The methods,compositions, computer software products and systems are particularlyuseful for simultaneously quantifying and characterizing mRNAs.

I. Detecting Sequence Elements

Activity of a gene is reflected by the activity of its product(s): theproteins or other molecules encoded by the gene. Those product moleculesperform biological functions. Directly measuring the activity of a geneproduct is, however, often difficult for certain genes. Instead, theimmunological activities or the amount of the final product(s) or itspeptide processing intermediates are determined as a measurement of thegene activity. More frequently, the amount or activity of intermediates,such as transcripts, RNA processing intermediates, or mature mRNAs aredetected as a measurement of gene activity. The term “mRNA” refers totranscripts of a gene. Transcripts are RNAs including, for example,mature messenger RNA ready for translation, products of various stagesof transcript processing. Transcript processing may include splicing,editing and degradation.

In many cases, the form and function of the final product(s) of a geneis unknown. In those cases, the activity of a gene is measuredconveniently by the amount or activity of transcript(s), RNA processingintermediate(s), mature mRNA(s) or its protein product(s).

A transcriptional unit is a continuous segment of DNA that istranscribed into RNA. For example, bacteria can continuously transcribeseveral contiguous genes to make polycistronic mRNAs. The contiguousgenes are from the same transcriptional unit. It is well known in theart that higher organisms also use several mechanisms to make a varietyof different gene products from a single transcriptional unit.

Many genes are known to have several alternative promoters, the use ofeach promoter resulting in one particular transcript. Generally, the useof 5′ promoter results in a product that has additional sequenceelements that is absent in the products resulted from relatively 3′promoters. The use of alternative promoters is frequently employed toregulate tissue specific gene expression. For example, human dystrophingene has at least seven promoters. The most 5′ upstream promoter is usedto transcribe a brain specific transcript; a promoter 100 kb down-streamfrom the first promoter is used to transcribe a muscle specifictranscript and a promoter 100 kb downstream of the second promoter isused to transcribe Purkinje cell specific transcript.

Similarly, alternative splicing is also important mechanisms forregulating gene activity, frequently in a tissue specific manner. InEukaryotes, nascent pre-mRNAs are generally not translated intoproteins. Rather, they are processed in several ways to generate maturemRNAs. RNA splicing is the most common method of RNA processing. Nascentpre-mRNAs are cut and pasted by specialized apparatus calledsplicesomes. Some non-coding regions transcribed from the intron regionsare excised. Exons are linked to form a contiguous coding region readyfor translation. In some splicing reactions, a single type of nascentpre-mRNAs are used to generate multiple types of mature RNA by a processcalled alternative splicing in which exons (sequence elements) arealternatively used to form different mature mRNAs which code fordifferent proteins. For example, the human Calcitonin gene (CALC) isspliced as calcitonin, a circulating Ca²⁺ homeostatic hormone, in thethyroid; as calcitonin gene-related peptide (CGRP), a neuromodulatoryand trophic factor, in the hypothalamus (See, Hodges and Bernstein,1994, Adv. Genet., 31, 207-281).

Alternative splicing is an important regulatory mechanism in highereukaryotes (Sharp, P. A. (1994) Cell., 77, 805-8152). By recentestimates, at least 30% of human genes are spliced alternatively(Mironov, A. A. and Gelfand, M. S. Proc. 1st Int. Conf. onBioinformatics of Genome Regulation, 1998. vol. 2, p. 249). Alternativesplicing plays a major role in sex determination in Drosophila, antibodyresponse in humans and other tissue or developmental stage specificprocesses (Stamm, S., Zhang, M. Q., Marr, T. G. and Helfman, D. M.,1994, Nucleic Acids Res., 22, 1515-1526; Chabot, B., 1996, TrendsGenet., 12, 472-478; Breitbart, R. E., Andreadis, A. and Nadal-Ginard,B., 1987, Annu. Rev. Biochem., 56, 467-495; Smith, C. W., Patton, J. G.and Nadal-Ginard, B., 1989, Annu. Rev. Genet., 23, 527-57). Alternativesplicing can generate up to 64 different mRNA variants from a singletranscript (Breitbart, R. E. and Nadal-Ginard, N. 1987, Cell, 46,793-803). All cited references are incorporated herein by reference forall purposes.

High-density arrays are particularly useful for monitoring theexpression control at the transcriptional, RNA processing anddegradation level. The fabrication and application of high densityarrays in gene expression monitoring have been disclosed previously in,for example, U.S. Pat. No. 6,040,138, incorporated herein by referencefor all purposes. In some embodiment using high-density arrays,high-density oligonucleotide arrays are synthesized using methods suchas the Very Large Scale Immobilized Polymer Synthesis (VLSIPS) disclosedin U.S. Pat. No. 5,445,934 incorporated herein for all purposes byreference. Each oligonucleotide occupies a known location on asubstrate. A nucleic acid target sample is hybridized with ahigh-density array of oligonucleotides and then the amount of targetnucleic acids hybridized to each probe in the array is quantified. Onepreferred quantifying method is to use confocal microscope andfluorescent labels. The GeneChip® system (Affymetrix, Santa Clara,Calif.) is particularly suitable for quantifying the hybridization;however, it is apparent to those of skill in the art that any similarsystems or other effectively equivalent detection methods can also beused.

High-density arrays are suitable for quantifying small variations inexpression levels of a gene in the presence of a large population ofheterogeneous nucleic acids. Such high-density arrays can be fabricatedeither by de novo synthesis on a substrate or by spotting ortransporting nature nucleic acid sequences onto specific locations ofsubstrate. Nucleic acids are purified and/or isolated from biologicalmaterials, such as a bacteria plasmid containing a cloned segment ofsequence of interest.

Oligonucleotide arrays are particularly preferred for this invention.Oligonucleotide arrays have numerous advantages, as opposed to othermethods, such as efficiency of production, reduced intra- and interarray variability, increased information content and high signal tonoise ratio.

Preferred high density arrays for gene function identification andgenetic network mapping comprise greater than about 100, preferablygreater than about 1000, more preferably greater than about 16,000 andmost preferably greater than 65,000 or 250,000 or even greater thanabout 1,000,000 different oligonucleotide probes, preferably in lessthan 1 cm² of surface area. The oligonucleotide probes range from about5 to about 50 or about 500 nucleotides, more preferably from about 10 toabout 40 nucleotide and most preferably from about 15 to about 40nucleotides in length.

Oligonucleotide probe arrays containing probes targeting exon sequencesmay be selected to detect and quantify various transcripts. By usingthese exon probes, the presence of particular exons in a biologicalsample may be determined. In the following sections, methods for designprobe arrays for detecting and quantifying target nucleic acids ofspecific configurations (arrangement of sequence elements) are provided.

II. Probes for Detecting Combination of Sequence Elements

In one aspect of the invention, nucleic acid probes are provided fordetermining and optionally quantifying the arrangement of sequenceelements. These probes may be preferably immobilized on a substrate as aprobe array.

In some embodiments of the invention, a probe set is designed tointerrogate the sequence of the region that joins two sequence elements(see, FIG. 2). Once the sequence of the region joining two sequenceelements is known, the combination of sequence elements can beascertained. For example, as shown in FIG. 2, two sequence elements 1and 2 may be alternatively used to form:

Configuration 1: Element 1-element 3

Configuration 2: Element 2-element 3

Probe sets for tiling the region joining elements 1 and 3 and elements 2and 3 may be designed to determine the presence of configurations 1 and2. Because the hybridization signals also reflects the levels ofsequences, relative levels of configuration 1 and configuration 2 in abiological sample may also be determined. Methods for quantitativelydetermining the level of large number of mRNAs are disclosed in, forexample, U.S. Pat. No. 6,040,138, incorporated herein by reference forall purposes.

In one embodiment (FIG. 3), probes may be designed to detect thetranscripts of a target gene that has three exons (from 5′ to 3′, exon1, exon 2 and exon 3). In this embodiment, a first set of probes weredesigned for tiling the 3′ region of the exon 1 and the 5′ region of theexon 2. A second set of probes are designed for tiling the 3′ region ofthe exon 1 and the 5′ region of the exon 3. A third set of probes aredesigned for tiling the 3′ region of the exon 2 and 5′ region of theexon 3. The tiling region of the probe sets may be at least 10 bases,preferably at least 20 bases, and more preferably at least 40 bases. Insome instances, the tiling region may be at least 100 bases.

FIG. 4 shows a gene that has four exons. Exon 1 may be spliced to joinexon 2, 3 or 4. Exon 2 may be spliced to join exon 3 or 4. Exon 3 and 4may be joined. Tiling probes (small bar under the exons) are designed tointerrogate the joining sequences. Based upon the determined sequences,the various configurations may be ascertained.

Methods for designing probes for tiling a region for resequence purposewere disclosed in, for example, U.S. Pat. No. 5,571,639 and Chee et al.1996, Accessing Genetic Information with High-Density DNA Arrays,Science, 274: 610-614, both incorporated herein by reference for allpurposes.

The methods of the invention have wide applications. For example, insome embodiments, the methods of the invention may be used to determinethe relative levels of splice variants. By determining the relativesplice variants, the regulation of gene expression by alternativesplicing may be understood, which may in turn provide informationimportant for disease detection, drug discovery and monitoring ofmedical treatment.

The methods of the invention are not limited to the study of genes whoseexon boundary is completely known. In contrast, because of the use oftiling probe sets, the methods of the invention allows some ambiguity ofthe knowledge about the exon boundary. The probe sets may be useful forunderstanding the precise splicing sites.

One of skill in the art would appreciate that the methods of theinvention are not limited to the study of splice variants. Instead, themethods are generally applicable to the study of arrangement of anynucleic acid sequence elements. For example, the methods are also usefulfor determining somatic recombination and RNA editing.

III. Methods, Systems and Computer Software for Designing Probes

Methods, systems and computer software for designing the probe sets arealso provided. In some embodiments, the method for designing probesinclude steps of obtaining sequence information of at least two sequenceelements (such as two exons). The possible joining region between thetwo sequence elements is identified. Probes for tiling the region areselected.

In some other embodiments, genomic DNA sequence of a gene is obtained.Intron exon structure is predicted. Because of the limitation of somesplicing site predication algorithms, the splice site may be somewhatambiguously determined. Probes for tiling the joining regions betweenpredicted exons are selected.

In some additional embodiments, the exon/intron boundary may bedetermined by comparing the sequence of transcripts and genomicsequences. Probes for tiling the regions joining two exons are selected.

FIG. 5 shows a process for computer assisted selection of probes. Exonsequences of one gene is inputted (501). The joining sequence(s) for oneof the alternatively spliced mRNA is constructed in a memory (502). Thetiling probes to interrogate the sequence are selected (503). Theprocess then continues to select tiling probes for another alternativelyspliced mRNA until all mRNA variants from the gene are processed (504).The process then proceed to input exon sequences of another gene (501).

In some embodiments, a computerized system is used for forming andanalyzing arrays of biological materials such as RNA or DNA. A digitalcomputer is used to design arrays of biological polymers such as RNA orDNA. The computer may be, for example, an appropriately programmed SunWorkstation or Intel Pentium based personal computer or work station,including appropriate memory, a CPU and other storage media such as ahard-drive, optionally a CD-ROM, a Zip drive. The computer may beconnected to a network such as a local area network and connected to awide area network, such as the Internet optionally via a proxy server.The computer's capability for accessing to the Internet may be preferredin some embodiments wherein sequence databases may be accessed via theInternet.

The computer system obtains inputs from a user regarding desiredcharacteristics of a gene of interest, and other inputs regarding thedesired features of the array. Optionally, the computer system mayobtain information regarding a specific genetic sequence of interestfrom an external or internal database such as GenBank(http)://www.ncbi.nlm.nih.gov, last visited on Apr. 25, 2000). Theoutput of the computer system is a set of chip design computer files.

The chip design files are provided to a system that designs thelithographic masks used in the fabrication of arrays of molecules suchas DNA. The system or process may include the hardware necessary tomanufacture masks and also the necessary computer hardware and softwarenecessary to lay the mask patterns out on the mask in an efficientmanner. Such equipment may or may not be located at the same physicalsite. The system generates masks such as chrome-on-glass masks for usein the fabrication of polymer arrays.

The masks, as well as selected information relating to the design of thechips from a system, are used in a synthesis system. Synthesis systemincludes the necessary hardware and software used to fabricate arrays ofpolymers on a substrate or chip. For example, synthesizer includes alight source and a chemical flow cell on which the substrate or chip isplaced. Mask may be placed between the light source and thesubstrate/chip, and the two are translated relative to each other atappropriate times for deprotection of selected regions of the chip.Selected chemical reagents are directed through flow cell for couplingto deprotected regions, as well as for washing and other operations. Alloperations are preferably directed by an appropriately programmeddigital computer, which may or may not be the same computer as thecomputer(s) used in mask design and mask making.

The sequences of various probes to be synthesized on the chip areselected and the physical arrangement of the probes on the chip isdetermined. For example, the joining region of the target nucleic acidsequence of interest will be a k-mer, preferably k is greater than 20,more preferably more than 40 and even more preferably more than 100,while the probes on the chip will be n-mers, where n is less than k.Accordingly, it will be necessary for the software to choose and locatethe n-mers that will be synthesized on the chip such that the chip maybe used to determine if a particular nucleic acid sample contains thejoining region of the target nucleic acid.

In general, the tiling of a sequence will be performed by taking n-basepiece of the target, and determining the complement to that n-basepiece. The system will then move down the target one position, andidentify the complement to the next n-bit piece. These n-base pieceswill be the sequences placed on the chip when only the sequence is to betiled.

As a simple example, suppose the target nucleic acid is 5′-ACGTTGCA-3′.Suppose that the chip will have 4-mers synthesized thereon. The 4-merprobes that will be complementary to the nucleic acid of interest willbe 3′-TGCA (complement to the first four positions), 3′-GCAA (complementto positions 2, 3, 4 and 5), 3′-CAAC (complement to positions 3, 4, 5and 6), 3′-AACG (complement to positions 4, 5, 6 and 7), and 3′-ACGT(complement to the last four positions). Accordingly, assuming the userhas selected sequence tiling, the system determines that the sequence ofthe probes to be synthesized will be 3′-TGCA, 3′-GCAA, 3′-CAAC, 3′-AACG,and 3′-ACGT. If a particular sample has the target sequence, bindingwill be exhibited at the sites of each 4-mer probe. If a particularsample does not have the sequence 5′-ACGTTGCA-3′, little or no bindingwill be exhibited at the sites of one or more of the probes on thesubstrate.

The system then determines if additional tiling is to be done and, ifso, repeats.

After the probes have been selected, the system may minimize the numberof synthesis cycles need to form the array of probes. To perform thisstep, the probes that are to be synthesized are evaluated according to aspecified algorithm to determine which bases are to be added in whichorder.

One algorithm uses a synthesis “template,” preferably a template thatallows for minimization of the number of synthesis cycles needed to formthe array of probes. One “template” is the repeated addition of ACGTACGT. . . . All possible probes could be synthesized with a sufficientlylong repetition of this template of synthesis cycles. By evaluating theprobes against this (and/or other) templates, many steps may be deletedto generate various trial synthesis strategies. A trial synthesisstrategy is tested by asking, for each base in the template “can theprobes be synthesized without this base addition?” In other words, a“trial strategy” can be used to synthesize the probes if every base inevery probe may be synthesized in the proper order using some subset ofthe template. If so, this base addition is deleted from the template.Other bases are then tested for removal

In the specific embodiment discussed below, a synthesis strategy isdeveloped by one or a combination of several algorithms. Thismethodology may be designed to result in, for example, a small number ofsynthesis cycles, a small number of differences between adjacent probeson the chip. In one particular embodiment, this system will reduce thenumber of sequence step differences between adjacent probes in “columns”of a tiled sequence, i.e., it will reduce the number of times a monomeris added in one synthesis region when it is not added in an adjacentregion. These are both desirable properties of a synthesis strategy.

IV. Methods, Systems and Computer Software for Detecting Combination ofSequence Elements

Methods, systems and computer software for detecting combination ofsequence elements are provided. In some embodiments, a probe array isused to determine a target sequence that contains at least two sequenceelements. At least one of the two sequence elements is selected from agroup of at least two different sequence elements. In these embodiments,the probe array contains probes interrogating the sequence regionsjoining the two sequence elements. The exact arrangement of the sequenceelements can be determined based upon the interrogation of the joiningsequence region. In a sample containing two or more types of targetsequences that have different combination of sequence arrangement (suchas alternatively spliced transcripts from one gene), the relative levelsof the different types of target sequences may be determined based uponhybridization intensity of interrogation probes. The term “quantifying”when used in the context of quantifying transcription levels of a genecan refer to absolute or to relative quantification. Absolutequantification may be accomplished by inclusion of knownconcentration(s) of one or more target nucleic acids (e.g., controlnucleic acids such as Bio B or with known amounts the target nucleicacids themselves) and referencing the hybridization intensity ofunknowns with the known target nucleic acids (e.g., through generationof a standard curve). Alternatively, relative quantification can beaccomplished by comparison of hybridization signals between two or moregenes, or between two or more treatments to quantify the changes inhybridization intensity and, by implication, transcription level.Methods for quantitatively analyzing a target sequence using single ormultiple probes on a substrate is described in, for example, 6,040,138,incorporated herein by reference for all purposes.

IV. Gene Expression Monitoring Methods

As discussed above, any methods that measure the activity of a gene areuseful for at least some embodiments of this invention. For example,traditional Northern blotting and hybridization, nuclease protection,RT-PCR and differential display have been used for detecting geneactivity. Those methods are useful for some embodiments of theinvention. However, this invention is most useful in conjunction withmethods for detecting the expression of a large number of genes.

High-density arrays are particularly useful for monitoring theexpression control at the transcriptional, RNA processing anddegradation level. The fabrication and application of high densityarrays in gene expression monitoring have been disclosed previously in,for example, U.S. Pat. No. 5,800,992, issued Sep. 1, 1988, and U.S.application Ser. No. 08/772,376, filed Dec. 23, 1996, all incorporatedherein for all purposes by reference. In some embodiments usinghigh-density arrays, high-density oligonucleotide arrays are synthesizedusing methods such as the Very Large Scale Immobilized Polymer Synthesis(VLSIPS) disclosed in U.S. Pat. No. 5,445,934 incorporated herein forall purposes by reference. Each oligonucleotide occupies a knownlocation on a substrate. A nucleic acid target sample is hybridized witha high-density array of oligonucleotides and then the amount of targetnucleic acids hybridized to each probe in the array is quantified. Onepreferred quantifying method is to use confocal microscope andfluorescent labels. The GeneChip® Probe Array system (Affymetrix, SantaClara, Calif.) is particularly suitable for quantifying thehybridization; however, it is apparent to those of skill in the art thatany similar systems or other effectively equivalent detection methodscan also be used.

High-density arrays are suitable for quantifying small variations inexpression levels of a gene in the presence of a large population ofheterogeneous nucleic acids. Such high-density arrays can be fabricatedeither by de novo synthesis on a substrate or by spotting ortransporting nature nucleic acid sequences onto specific locations ofsubstrate. Nucleic acids are purified and/or isolated from biologicalmaterials, such as a bacteria plasmid containing a cloned segment ofsequence of interest. Suitable nucleic acids are also produced byamplification of templates. As a nonlimiting illustration, polymerasechain reaction, and/or in vitro transcription, are suitable nucleic acidamplification methods.

Synthesized oligonucleotide arrays are particularly preferred for thisinvention. Oligonucleotide arrays have numerous advantages, as opposedto other methods, such as efficiency of production, reduced intra- andinter array variability, increased information content and high signalto noise ratio.

Preferred high density arrays for gene function identification andgenetic network mapping comprise greater than about 100, preferablygreater than about 1000, more preferably greater than about 16,000 andmost preferably greater than 65,000 or 250,000 or even greater thanabout 1,000,000 different oligonucleotide probes, preferably in lessthan 1 cm of surface area. The oligonucleotide probes range from about 5to about 50 or about 500 nucleotides, more preferably from about 10 toabout 40 nucleotide and most preferably from about 15 to about 40nucleotides in length.

A. Massive Parallel Gene Expression Monitoring

One preferred method for massive parallel gene expression monitoring isbased upon high-density nucleic acid arrays.

Generally those methods of monitoring gene expression involve (a)providing a pool of target nucleic acids comprising RNA transcript(s) ofone or more target gene(s), or nucleic acids derived from the RNAtranscript(s); (b) hybridizing the nucleic acid sample to a high densityarray of probes and (c) detecting the hybridized nucleic acids andcalculating a relative and/or absolute expression (transcription, RNAprocessing or degradation) level.

1. Providing a Nucleic Acid Sample

One of skill in the art will appreciate that it is desirable to havenucleic samples containing target nucleic acid sequences that reflectthe transcripts of interest. Therefore, suitable nucleic acid samplesmay contain transcripts of interest. Suitable nucleic acid samples,however, may contain nucleic acids derived from the transcripts ofinterest. As used herein, a nucleic acid derived from a transcriptrefers to a nucleic acid for whose synthesis the mRNA transcript or asubsequence thereof has ultimately served as a template. Thus, a cDNAreverse transcribed from a transcript, an RNA transcribed from thatcDNA, a DNA amplified from the cDNA, an RNA transcribed from theamplified DNA, etc., are all derived from the transcript and detectionof such derived products is indicative of the presence and/or abundanceof the original transcript in a sample. Thus, suitable samples include,but are not limited to, transcripts of the gene or genes, cDNA reversetranscribed from the transcript, cRNA transcribed from the cDNA, DNAamplified from the genes, RNA transcribed from amplified DNA, and thelike. Transcripts, as used herein, may include, but not limited topre-mRNA nascent transcript(s), transcript processing intermediates,mature mRNA(s) and degradation products. It is not necessary to monitorall types of transcripts to practice this invention. For example, onemay choose to practice the invention to measure the mature mRNA levelsonly.

In one embodiment, such a sample is a homogenate of cells or tissues orother biological samples. Preferably, such sample is a total RNApreparation of a biological sample. More preferably in some embodiments,such a nucleic acid sample is the total mRNA isolated from a biologicalsample. Those of skill in the art will appreciate that the total mRNAprepared with most methods includes not only the mature mRNA, but alsothe RNA processing intermediates and nascent pre-mRNA transcripts. Forexample, total mRNA purified with poly (T) column contains RNA moleculeswith poly (A) tails. Those poly A+ RNA molecules could be mature mRNA,RNA processing intermediates, nascent transcripts or degradationintermediates.

Biological samples may be of any biological tissue or fluid or cells.Frequently the sample will be a “clinical sample” which is a samplederived from a patient. Clinical samples provide a rich source ofinformation regarding the various states of genetic network or geneexpression. Some embodiments of the invention are employed to detectmutations and to identify the function of mutations. Such embodimentshave extensive applications in clinical diagnostics and clinicalstudies. Typical clinical samples include, but are not limited to,sputum, blood, blood cells (e.g., white cells), tissue or fine needlebiopsy samples, urine, peritoneal fluid, and pleural fluid, or cellstherefrom. Biological samples may also include sections of tissues suchas frozen sections taken for histological purposes.

Another typical source of biological samples are cell cultures wheregene expression states can be manipulated to explore the relationshipamong genes. In one aspect of the invention, methods are provided togenerate biological samples reflecting a wide variety of states of thegenetic network.

One of skill in the art would appreciate that it is desirable to inhibitor destroy RNase present in homogenates before homogenates can be usedfor hybridization. Methods of inhibiting or destroying nucleases arewell known in the art. In some preferred embodiments, cells or tissuesare homogenized in the presence of chaotropic agents to inhibitnuclease. In some other embodiments, RNase are inhibited or destroyed byheart treatment followed by proteinase treatment.

Methods of isolating total mRNA are also well known to those of skill inthe art. For example, methods of isolation and purification of nucleicacids are described in detail in Chapter 3 of Laboratory Techniques inBiochemistry and Molecular Biology: Hybridization With Nucleic AcidProbes, Part I. Theory and Nucleic Acid Preparation, P. Tijssen, ed.Elsevier, N.Y. (1993) and Chapter 3 of Laboratory Techniques inBiochemistry and Molecular Biology: Hybridization With Nucleic AcidProbes, Part I. Theory and Nucleic Acid Preparation, P. Tijssen, ed.Elsevier, N.Y. (1993)).

In a preferred embodiment, the total RNA is isolated from a given sampleusing, for example, an acid guanidinium-phenol-chloroform extractionmethod and polyA⁺ mRNA is isolated by oligo dT column chromatography orby using (dT)n magnetic beads (see, e.g., Sambrook et al., MolecularCloning: A Laboratory Manual (2nd ed.), Vols. 1-3, Cold Spring HarborLaboratory, (1989), or Current Protocols in Molecular Biology, F.Ausubel et al., ed. Greene Publishing and Wiley-Interscience, New York(1987)).

Frequently, it is desirable to amplify the nucleic acid sample prior tohybridization. One of skill in the art will appreciate that whateveramplification method is used, if a quantitative result is desired, caremust be taken to use a method that maintains or controls the relativefrequencies of the amplified nucleic acids to achieve quantitativeamplification.

Methods of “quantitative” amplification are well known to those of skillin the art. For example, quantitative PCR involves simultaneouslyco-amplifying a known quantity of a control sequence using the sameprimers. This provides an internal standard that may be used tocalibrate the PCR reaction. The high density array may then includeprobes specific to the internal standard for quantification of theamplified nucleic acid.

Other suitable amplification methods include, but are not limited topolymerase chain reaction (PCR) (Innis, et al., PCR Protocols. A guideto Methods and Application. Academic Press, Inc. San Diego, (1990)),ligase chain reaction (LCR) (see Wu and Wallace, Genomics, 4: 560(1989), Landegren, et al., Science, 241: 1077 (1988) and Barringer, etal., Gene, 89: 117 (1990), transcription amplification (Kwoh, et al.,Proc. Natl. Acad. Sci. USA, 86: 1173 (1989)), and self-sustainedsequence replication (Guatelli, et al., Proc. Nat. Acad. Sci. USA, 87:1874 (1990)).

Cell lysates or tissue homogenates often contain a number of inhibitorsof polymerase activity. Therefore, RT-PCR typically incorporatespreliminary steps to isolate total RNA or mRNA for subsequent use as anamplification template. One tube mRNA capture method may be used toprepare poly(A)+ RNA samples suitable for immediate RT-PCR in the sametube (Boehringer Mannheim). The captured mRNA can be directly subjectedto RT-PCR by adding a reverse transcription mix and, subsequently, a PCRmix.

In a particularly preferred embodiment, the sample mRNA is reversetranscribed with a reverse transcriptase and a primer consisting ofoligo dT and a sequence encoding the phage T7 promoter to provide singlestranded DNA template. The second DNA strand is polymerized using a DNApolymerase. After synthesis of double-stranded cDNA, T7 RNA polymeraseis added and RNA is transcribed from the cDNA template. Successiverounds of transcription from each single cDNA template results inamplified RNA. Methods of in vitro polymerization are well known tothose of skill in the art (see, e.g., Sambrook, supra.) and thisparticular method is described in detail by Van Gelder, et al., Proc.Natl. Acad. Sci. USA, 87: 1663-1667 (1990). Moreover, Eberwine et al.Proc. Natl. Acad. Sci. USA, 89: 3010-3014 provide a protocol that usestwo rounds of amplification via in vitro transcription to achievegreater than 10⁶ fold amplification of the original starting materialthereby permitting expression monitoring even where biological samplesare limited.

CRNA amplification methods disclosed in U.S. Provisional Application No.60/172,340, filed on Dec. 16, 1999.

It will be appreciated by one of skill in the art that the directtranscription method described above provides an antisense (aRNA) pool.Where antisense RNA is used as the target nucleic acid, theoligonucleotide probes provided in the array are chosen to becomplementary to subsequences of the antisense nucleic acids.Conversely, where the target nucleic acid pool is a pool of sensenucleic acids, the oligonucleotide probes are selected to becomplementary to subsequences of the sense nucleic acids. Finally, wherethe nucleic acid pool is double stranded, the probes may be of eithersense as the target nucleic acids include both sense and antisensestrands.

The protocols cited above include methods of generating pools of eithersense or antisense nucleic acids. Indeed, one approach can be used togenerate either sense or antisense nucleic acids as desired. Forexample, the cDNA can be directionally cloned into a vector (e.g.,Stratagene's p Bluscript II KS (+) phagemid) such that it is flanked bythe T3 and T7 promoters. In vitro transcription with the T3 polymerasewill produce RNA of one sense (the sense depending on the orientation ofthe insert), while in vitro transcription with the T7 polymerase willproduce RNA having the opposite sense. Other suitable cloning systemsinclude phage lambda vectors designed for Cre-loxP plasmid subcloning(see e.g., Palazzolo et al., Gene, 88: 25-36 (1990)).

B. Hybridizing Nucleic Acids to High Density Array

1. Probe Design

One of skill in the art will appreciate that an enormous number of arraydesigns are suitable for the practice of this invention. The highdensity array will typically include a number of probes thatspecifically hybridize to the sequences of interest. In addition, in apreferred embodiment, the array will include one or more control probes.

The high density array chip includes “test probes.” Test probes could beoligonucleotides that range from about 5 to about 45 or 5 to about 500nucleotides, more preferably from about 10 to about 40 nucleotides andmost preferably from about 15 to about 40 nucleotides in length. Inother particularly preferred embodiments the probes are 20 or 25nucleotides in length. In another preferred embodiment, test probes aredouble or single strand DNA sequences. DNA sequences are isolated orcloned from nature sources or amplified from nature sources using naturenucleic acid as templates. These probes have sequences complementary toparticular subsequences of the genes whose expression they are designedto detect. Thus, the test probes are capable of specifically hybridizingto the target nucleic acid they are to detect.

In addition to test probes that bind the target nucleic acid(s) ofinterest, the high density array can contain a number of control probes.The control probes fall into three categories referred to herein as 1)Normalization controls; 2) Expression level controls; and 3) Mismatchcontrols which are designed to contain at least one base that isdifferent from that of a target sequence. Normalization controls areoligonucleotide or other nucleic acid probes that are complementary tolabeled reference oligonucleotides or other nucleic acid sequences thatare added to the nucleic acid sample. The signals obtained from thenormalization controls after hybridization provide a control forvariations in hybridization conditions, label intensity, “reading”efficiency and other factors that may cause the signal of a perfecthybridization to vary between arrays. In a preferred embodiment, signals(e.g., fluorescence intensity) read from all other probes in the arrayare divided by the signal (e.g., fluorescence intensity) from thecontrol probes thereby normalizing the measurements.

Virtually any probe may serve as a normalization control. However, it isrecognized that hybridization efficiency varies with base compositionand probe length. Preferred normalization probes are selected to reflectthe average length of the other probes present in the array, however,they can be selected to cover a range of lengths. The normalizationcontrol(s) can also be selected to reflect the (average) basecomposition of the other probes in the array, however in a preferredembodiment, only one or a few normalization probes are used and they areselected such that they hybridize well (i.e., no secondary structure)and do not match any target-specific probes.

Expression level controls are probes that hybridize specifically withconstitutively expressed genes in the biological sample. Virtually anyconstitutively expressed gene provides a suitable target for expressionlevel controls. Typically expression level control probes have sequencescomplementary to subsequences of constitutively expressed “housekeepinggenes” including, but not limited to the B-actin gene, the transferrinreceptor gene, the GAPDH gene, and the like. Mismatch controls may alsobe provided for the probes to the target genes, for expression levelcontrols or for normalization controls. Mismatch controls areoligonucleotide probes or other nucleic acid probes designed to beidentical to their corresponding test, target or control probes exceptfor the presence of one or more mismatched bases. A mismatched base is abase selected so that it is not complementary to the corresponding basein the target sequence to which the probe would otherwise specificallyhybridize. One or more mismatches are selected such that underappropriate hybridization conditions (e.g., stringent conditions) thetest or control probe would be expected to hybridize with its targetsequence, but the mismatch probe would not hybridize (or would hybridizeto a significantly lesser extent). Preferred mismatch probes contain acentral mismatch. Thus, for example, where a probe is a 20 mer, acorresponding mismatch probe will have the identical sequence except fora single base mismatch (e.g., substituting a G, a C or a T for an A) atany of positions 6 through 14 (the central mismatch).

Mismatch probes thus provide a control for non-specific binding orcross-hybridization to a nucleic acid in the sample other than thetarget to which the probe is directed. Mismatch probes thus indicatewhether a hybridization is specific or not. For example, if the targetis present the perfect match probes should be consistently brighter thanthe mismatch probes. In addition, if all central mismatches are present,the mismatch probes can be used to detect a mutation. The difference inintensity between the perfect match and the mismatch probe (I(PM)-I(MM))provides a good measure of the concentration of the hybridized material.

The high density array may also include sample preparation/amplificationcontrol probes. These are probes that are complementary to subsequencesof control genes selected because they do not normally occur in thenucleic acids of the particular biological sample being assayed.Suitable sample preparation/amplification control probes include, forexample, probes to bacterial genes (e.g., Bio B) where the sample inquestion is a biological from a eukaryote.

The RNA sample is then spiked with a known amount of the nucleic acid towhich the sample preparation/amplification control probe is directedbefore processing. Quantification of the hybridization of the samplepreparation/amplification control probe then provides a measure ofalteration in the abundance of the nucleic acids caused by processingsteps (e.g., PCR, reverse transcription, in vitro transcription, etc.).

In a preferred embodiment, oligonucleotide probes in the high densityarray are selected to bind specifically to the nucleic acid target towhich they are directed with minimal non-specific binding orcross-hybridization under the particular hybridization conditionsutilized. Because the high density arrays of this invention can containin excess of 1,000,000 different probes, it is possible to provide everyprobe of a characteristic length that binds to a particular nucleic acidsequence. Thus, for example, the high density array can contain everypossible 20 mer sequence complementary to an IL-2 mRNA.

There, however, may exist 20 mer subsequences that are not unique to theIL-2 mRNA. Probes directed to these subsequences are expected to crosshybridize with occurrences of their complementary sequence in otherregions of the sample genome. Similarly, other probes simply may nothybridize effectively under the hybridization conditions (e.g., due tosecondary structure, or interactions with the substrate or otherprobes). Thus, in a preferred embodiment, the probes that show such poorspecificity or hybridization efficiency are identified and may not beincluded either in the high density array itself (e.g., duringfabrication of the array) or in the post-hybridization data analysis.

In addition, in a preferred embodiment, expression monitoring arrays areused to identify the presence and expression (transcription) level ofgenes which are several hundred base pairs long. For most applicationsit would be useful to identify the presence, absence, or expressionlevel of several thousand to one hundred thousand genes. Because thenumber of oligonucleotides per array is limited in a preferredembodiment, it is desired to include only a limited set of probesspecific to each gene whose expression is to be detected.

As disclosed in U.S. application Ser. No. 08/772,376, probes as short as15, 20, or 25 nucleotide are sufficient to hybridize to a subsequence ofa gene and that, for most genes, there is a set of probes that performswell across a wide range of target nucleic acid concentrations. In apreferred embodiment, it is desirable to choose a preferred or “optimum”subset of probes for each gene before synthesizing the high densityarray.

2. Forming High Density Arrays.

Methods of forming high density arrays of oligonucleotides, peptides andother polymer sequences with a minimal number of synthetic steps areknown. The oligonucleotide analogue array can be synthesized on a solidsubstrate by a variety of methods, including, but not limited to,light-directed chemical coupling, and mechanically directed coupling.See Pirrung et al., U.S. Pat. No. 5,143,854 (see also PCT ApplicationNo. WO 90/15070) and Fodor et al., PCT Publication Nos. WO 92/10092 andWO 93/09668 and U.S. Ser. No. 07/980,523 which disclose methods offorming vast arrays of peptides, oligonucleotides and other moleculesusing, for example, light-directed synthesis techniques. See also, Fodoret al., Science, 251, 767-77 (1991). These procedures for synthesis ofpolymer arrays are now referred to as VLSIPS™ procedures. Using theVLSIPS™ approach, one heterogeneous array of polymers is converted,through simultaneous coupling at a number of reaction sites, into adifferent heterogeneous array. See, U.S. application Ser. Nos.07/796,243 and 07/980,523.

The development of VLSIPS™ technology as described in the above-notedU.S. Pat. No. 5,143,854 and PCT patent publication Nos. WO 90/15070 and92/10092, is considered pioneering technology in the fields ofcombinatorial synthesis and screening of combinatorial libraries. Morerecently, patent application Ser. No. 08/082,937, filed Jun. 25, 1993describes methods for making arrays of oligonucleotide probes that canbe used to check or determine a partial or complete sequence of a targetnucleic acid and to detect the presence of a nucleic acid containing aspecific oligonucleotide sequence.

In brief, the light-directed combinatorial synthesis of oligonucleotidearrays on a glass surface proceeds using automated phosphoramiditechemistry and chip masking techniques. In one specific implementation, aglass surface is derivatized with a silane reagent containing afunctional group, e.g., a hydroxyl or amine group blocked by aphotolabile protecting group. Photolysis through a photolithogaphic maskis used selectively to expose functional groups which are then ready toreact with incoming 5′-photoprotected nucleoside phosphoramidites. Thephosphoramidites react only with those sites which are illuminated (andthus exposed by removal of the photolabile blocking group). Thus, thephosphoramidites only add to those areas selectively exposed from thepreceding step. These steps are repeated until the desired array ofsequences have been synthesized on the solid surface. Combinatorialsynthesis of different oligonucleotide analogues at different locationson the array is determined by the pattern of illumination duringsynthesis and the order of addition of coupling reagents.

In the event that an oligonucleotide analogue with a polyamide backboneis used in the VLSIPS™ procedure, it is generally inappropriate to usephosphoramidite chemistry to perform the synthetic steps, since themonomers do not attach to one another via a phosphate linkage. Instead,peptide synthetic methods are substituted. See, e.g., Pirrung et al.U.S. Pat. No. 5,143,854.

Peptide nucleic acids are commercially available from, e.g., Biosearch,Inc. (Bedford, Mass.) which comprise a polyamide backbone and the basesfound in naturally occurring nucleosides. Peptide nucleic acids arecapable of binding to nucleic acids with high specificity, and areconsidered “oligonucleotide analogues” for purposes of this disclosure.In addition to the foregoing, additional methods which can be used togenerate an array of oligonucleotides on a single substrate aredescribed in co-pending application Ser. No. 07/980,523, filed Nov. 20,1992, and Ser. No. 07/796,243, filed Nov. 22, 1991 and in PCTPublication No. WO 93/09668. In the methods disclosed in theseapplications, reagents are delivered to the substrate by either (1)flowing within a channel defined on predefined regions or (2) “spotting”on predefined regions or (3) through the use of photoresist. However,other approaches, as well as combinations of spotting and flowing, maybe employed. In each instance, certain activated regions of thesubstrate are mechanically separated from other regions when the monomersolutions are delivered to the various reaction sites.

A typical “flow channel” method applied to the compounds and librariesof the present invention can generally be described as follows. Diversepolymer sequences are synthesized at selected regions of a substrate orsolid support by forming flow channels on a surface of the substratethrough which appropriate reagents flow or in which appropriate reagentsare placed. For example, assume a monomer “A” is to be bound to thesubstrate in a first group of selected regions. If necessary, all orpart of the surface of the substrate in all or a part of the selectedregions is activated for binding by, for example, flowing appropriatereagents through all or some of the channels, or by washing the entiresubstrate with appropriate reagents. After placement of a channel blockon the surface of the substrate, a reagent having the monomer A flowsthrough or is placed in all or some of the channel(s). The channelsprovide fluid contact to the first selected regions, thereby binding themonomer A on the substrate directly or indirectly (via a spacer) in thefirst selected regions.

Thereafter, a monomer B is coupled to second selected regions, some ofwhich may be included among the first selected regions. The secondselected regions will be in fluid contact with a second flow channel(s)through translation, rotation, or replacement of the channel block onthe surface of the substrate; through opening or closing a selectedvalve; or through deposition of a layer of chemical or photoresist. Ifnecessary, a step is performed for activating at least the secondregions. Thereafter, the monomer B is flowed through or placed in thesecond flow channel(s), binding monomer B at the second selectedlocations. In this particular example, the resulting sequences bound tothe substrate at this stage of processing will be, for example, A, B,and AB. The process is repeated to form a vast array of sequences ofdesired length at known locations on the substrate.

After the substrate is activated, monomer A can be flowed through someof the channels, monomer B can be flowed through other channels, amonomer C can be flowed through still other channels, etc. In thismanner, many or all of the reaction regions are reacted with a monomerbefore the channel block must be moved or the substrate must be washedand/or reactivated. By making use of many or all of the availablereaction regions simultaneously, the number of washing and activationsteps can be minimized.

One of skill in the art will recognize that there are alternativemethods of forming channels or otherwise protecting a portion of thesurface of the substrate. For example, according to some embodiments, aprotective coating such as a hydrophilic or hydrophobic coating(depending upon the nature of the solvent) is utilized over portions ofthe substrate to be protected, sometimes in combination with materialsthat facilitate wetting by the reactant solution in other regions. Inthis manner, the flowing solutions are further prevented from passingoutside of their designated flow paths.

High density nucleic acid arrays can be fabricated by depositingpresynthezied or nature nucleic acids in predined positions. Asdisclosed in the U.S. Application Ser. No. and its parent applications,previously incorporated for all purposed, synthesized or nature nucleicacids are deposited on specific locations of a substrate by lightdirected targeting and oligonucleotide directed targeting. Nucleic acidscan also be directed to specific locations in much the same manner asthe flow channel methods. For example, a nucleic acid A can be deliveredto and coupled with a first group of reaction regions which have beenappropriately activated. Thereafter, a nucleic acid B can be deliveredto and reacted with a second group of activated reaction regions.Nucleic acids are deposited in selected regions. Another embodiment usesa dispenser that moves from region to region to deposit nucleic acids inspecific spots. Typical dispensers include a micropipette or capillarypin to deliver nucleic acid to the substrate and a robotic system tocontrol the position of the micropipette with respect to the substrate.In other embodiments, the dispenser includes a series of tubes, amanifold, an array of pipettes or capillary pins, or the like so thatvarious reagents can be delivered to the reaction regionssimultaneously.

3. Hybridization

Nucleic acid hybridization simply involves contacting a probe and targetnucleic acid under conditions where the probe and its complementarytarget can form stable hybrid duplexes through complementary basepairing. The nucleic acids that do not form hybrid duplexes are thenwashed away leaving the hybridized nucleic acids to be detected,typically through detection of an attached detectable label. It isgenerally recognized that nucleic acids are denatured by increasing thetemperature or decreasing the salt concentration of the buffercontaining the nucleic acids. Under low stringency conditions (e.g., lowtemperature and/or high salt) hybrid duplexes (e.g., DNA:DNA, RNA:RNA,or RNA:DNA) will form even where the annealed sequences are notperfectly complementary. Thus specificity of hybridization is reduced atlower stringency. Conversely, at higher stringency (e.g., highertemperature or lower salt) successful hybridization requires fewermismatches.

One of skill in the art will appreciate that hybridization conditionsmay be selected to provide any degree of stringency. In a preferredembodiment, hybridization is performed at low stringency in this case in6×SSPE-T at 37 C (0.005% Triton X-100) to ensure hybridization and thensubsequent washes are performed at higher stringency (e.g., 1×SSPE-T at37 C) to eliminate mismatched hybrid duplexes. Successive washes may beperformed at increasingly higher stringency (e.g., down to as low as0.25×SSPE-T at 37 C to 50 C) until a desired level of hybridizationspecificity is obtained. Stringency can also be increased by theaddition of agents such as formamide. Hybridization specificity may beevaluated by comparison of hybridization to the test probes withhybridization to the various controls that can be present (e.g.,expression level control, normalization control, mismatch controls,etc.). In general, there is a tradeoff between hybridization specificity(stringency) and signal intensity. Thus, in a preferred embodiment, thewash is performed at the highest stringency that produces consistentresults and that provides a signal intensity greater than approximately10% of the background intensity. Thus, in a preferred embodiment, thehybridized array may be washed at successively higher stringencysolutions and read between each wash. Analysis of the data sets thusproduced will reveal a wash stringency above which the hybridizationpattern is not appreciably altered and which provides adequate signalfor the particular oligonucleotide probes of interest.

In a preferred embodiment, background signal is reduced by the use of adetergent (e.g., C-TAB) or a blocking reagent (e.g., sperm DNA, cot-1DNA, etc.) during the hybridization to reduce non-specific binding. In aparticularly preferred embodiment, the hybridization is performed in thepresence of about 0.5 mg/ml DNA (e.g., herring sperm DNA). The use ofblocking agents in hybridization is well known to those of skill in theart (see, e.g., Chapter 8 in P. Tijssen, supra.)

The stability of duplexes formed between RNAs or DNAs are generally inthe order of RNA:RNA>RNA:DNA>DNA:DNA, in solution. Long probes havebetter duplex stability with a target, but poorer mismatchdiscrimination than shorter probes (mismatch discrimination refers tothe measured hybridization signal ratio between a perfect match probeand a single base mismatch probe). Shorter probes (e.g., 8-mers)discriminate mismatches very well, but the overall duplex stability islow.

Altering the thermal stability (T_(m)) of the duplex formed between thetarget and the probe using, e.g., known oligonucleotide analogues allowsfor optimization of duplex stability and mismatch discrimination. Oneuseful aspect of altering the T_(m) arises from the fact thatadenine-thymine (A-T) duplexes have a lower T_(m) than guanine-cytosine(G-C) duplexes, due in part to the fact that the A-T duplexes have 2hydrogen bonds per base-pair, while the G-C duplexes have 3 hydrogenbonds per base pair. In heterogeneous oligonucleotide arrays in whichthere is a non-uniform distribution of bases, it is not generallypossible to optimize hybridization for each oligonucleotide probesimultaneously. Thus, in some embodiments, it is desirable toselectively destabilize G-C duplexes and/or to increase the stability ofA-T duplexes. This can be accomplished, e.g., by substituting guanineresidues in the probes of an array which form G-C duplexes withhypoxanthine, or by substituting adenine residues in probes which formA-T duplexes with 2,6 diaminopurine or by using the salt tetramethylammonium chloride (TMACl) in place of NaCl.

Altered duplex stability conferred by using oligonucleotide analogueprobes can be ascertained by following, e.g., fluorescence signalintensity of oligonucleotide analogue arrays hybridized with a targetoligonucleotide over time. The data allow optimization of specifichybridization conditions at, e.g., room temperature (for simplifieddiagnostic applications in the future).

Another way of verifying altered duplex stability is by following thesignal intensity generated upon hybridization with time. Previousexperiments using DNA targets and DNA chips have shown that signalintensity increases with time, and that the more stable duplexesgenerate higher signal intensities faster than less stable duplexes. Thesignals reach a plateau or “saturate” after a certain amount of time dueto all of the binding sites becoming occupied. These data allow foroptimization of hybridization, and determination of the best conditionsat a specified temperature.

Methods of optimizing hybridization conditions are well known to thoseof skill in the art (see, e.g., Laboratory Techniques in Biochemistryand Molecular Biology, Vol. 24: Hybridization With Nucleic Acid Probes,P. Tijssen, ed. Elsevier, N.Y., (1993)).

C. Signal Detection

In a preferred embodiment, the hybridized nucleic acids are detected bydetecting one or more labels attached to the sample nucleic acids. Thelabels may be incorporated by any of a number of means well known tothose of skill in the art. However, in a preferred embodiment, the labelis simultaneously incorporated during the amplification step in thepreparation of the sample nucleic acids. Thus, for example, polymerasechain reaction (PCR) with labeled primers or labeled nucleotides willprovide a labeled amplification product. In a preferred embodiment,transcription amplification, as described above, using a labelednucleotide (e.g., fluorescein-labeled UTP and/or CTP) incorporates alabel into the transcribed nucleic acids.

Alternatively, a label may be added directly to the original nucleicacid sample (e.g., mRNA, polyA mRNA, cDNA, etc.) or to the amplificationproduct after the amplification is completed. Means of attaching labelsto nucleic acids are well known to those of skill in the art andinclude, for example nick translation or end-labeling (e.g., with alabeled RNA) by kinasing of the nucleic acid and subsequent attachment(ligation) of a nucleic acid linker joining the sample nucleic acid to alabel (e.g., a fluorophore).

Detectable labels suitable for use in the present invention include anycomposition detectable by spectroscopic, photochemical, biochemical,immunochemical, electrical, optical or chemical means. Useful labels inthe present invention include biotin for staining with labeledstreptavidin conjugate, magnetic beads (e.g., Dynabeads™), fluorescentdyes (e.g., fluorescein, texas red, rhodamine, green fluorescentprotein, and the like), radiolabels (e.g., ³ H, ¹²⁵I, ³⁵S, ¹⁴C, or ³²P),enzymes (e.g., horse radish peroxidase, alkaline phosphatase and otherscommonly used in an ELISA), and colorimetric labels such as colloidalgold or colored glass or plastic (e.g., polystyrene, polypropylene,latex, etc.) beads. Patents teaching the use of such labels include U.S.Pat. Nos. 3,817,837; 3,850,752; 3,939,350; 3,996,345; 4,277,437;4,275,149; and 4,366,241.

Means of detecting such labels are well known to those of skill in theart. Thus, for example, radiolabels may be detected using photographicfilm or scintillation counters, fluorescent markers may be detectedusing a photodetector to detect emitted light. Enzymatic labels aretypically detected by providing the enzyme with a substrate anddetecting the reaction product produced by the action of the enzyme onthe substrate, and colorimetric labels are detected by simplyvisualizing the colored label. One particularly preferred method usescolloidal gold label that can be detected by measuring scattered light.

The label may be added to the target (sample) nucleic acid(s) prior to,or after the hybridization. So called “direct labels” are detectablelabels that are directly attached to or incorporated into the target(sample) nucleic acid prior to hybridization. In contrast, so called“indirect labels” are joined to the hybrid duplex after hybridization.Often, the indirect label is attached to a binding moiety that has beenattached to the target nucleic acid prior to the hybridization. Thus,for example, the target nucleic acid may be biotinylated before thehybridization. After hybridization, an aviden-conjugated fluorophorewill bind the biotin bearing hybrid duplexes providing a label that iseasily detected. For a detailed review of methods of labeling nucleicacids and detecting labeled hybridized nucleic acids see LaboratoryTechniques in Biochemistry and Molecular Biology, Vol. 24: HybridizationWith Nucleic Acid Probes, P. Tijssen, ed. Elsevier, N.Y., (1993)).

Fluorescent labels are preferred and easily added during an in vitrotranscription reaction. In a preferred embodiment, fluorescein labeledUTP and CTP are incorporated into the RNA produced in an in vitrotranscription reaction as described above.

Means of detecting labeled target (sample) nucleic acids hybridized tothe probes of the high density array are known to those of skill in theart. Thus, for example, where a calorimetric label is used, simplevisualization of the label is sufficient. Where a radioactive labeledprobe is used, detection of the radiation (e.g., with photographic filmor a solid state detector) is sufficient.

In a preferred embodiment, however, the target nucleic acids are labeledwith a fluorescent label and the localization of the label on the probearray is accomplished with fluorescent microscopy. The hybridized arrayis excited with a light source at the excitation wavelength of theparticular fluorescent label and the resulting fluorescence at theemission wavelength is detected. In a particularly preferred embodiment,the excitation light source is a laser appropriate for the excitation ofthe fluorescent label.

The confocal microscope may be automated with a computer-controlledstage to automatically scan the entire high density array. Similarly,the microscope may be equipped with a phototransducer (e.g., aphotomultiplier, a solid state array, a CCD camera, etc.) attached to anautomated data acquisition system to automatically record thefluorescence signal produced by hybridization to each oligonucleotideprobe on the array. Such automated systems are described at length inU.S. Pat. No. 5,143,854, PCT Application 20 92/10092, and copending U.S.application Ser. No. 08/195,889 filed on Feb. 10, 1994. Use of laserillumination in conjunction with automated confocal microscopy forsignal detection permits detection at a resolution of better than about100 μm, more preferably better than about 50 μm, and most preferablybetter than about 25 μm.

One of skill in the art will appreciate that methods for evaluating thehybridization results vary with the nature of the specific probe nucleicacids used as well as the controls provided. In the simplest embodiment,simple quantification of the fluorescence intensity for each probe isdetermined. This is accomplished simply by measuring probe signalstrength at each location (representing a different probe) on the highdensity array (e.g., where the label is a fluorescent label, detectionof the amount of florescence (intensity) produced by a fixed excitationillumination at each location on the array). Comparison of the absoluteintensities of an array hybridized to nucleic acids from a “test” samplewith intensities produced by a “control” sample provides a measure ofthe relative expression of the nucleic acids that hybridize to each ofthe probes.

One of skill in the art, however, will appreciate that hybridizationsignals will vary in strength with efficiency of hybridization, theamount of label on the sample nucleic acid and the amount of theparticular nucleic acid in the sample. Typically nucleic acids presentat very low levels (e.g., <μM) will show a very weak signal. At some lowlevel of concentration, the signal becomes virtually indistinguishablefrom the background. In evaluating the hybridization data, a thresholdintensity value may be selected below which a signal is not counted asbeing essentially indistinguishable from background.

D. Additional Embodiments

In one embodiment of the invention, transcripts (preferably at least100, 1000, 10000 or all known transcripts) are identified based onsequence databases including public sequence databases. These sequencesare then aligned to genomic sequence such as the golden path genomicsequence using, for example, Pslayout. Gene features are then identifiedbased on the alignment. These features include exon, partial exon,intron, exon-exon junction, exon-intron and intron-exon junctionsequences. Probes (typically oligonucleotide of at least 20, 25, 30, 40,50, 60 bases) for detecting each of these features are then selected.The probes can be spotted or synthesized in situ to form microarrays fordetecting alternatively spliced transcripts.

The above description is illustrative and not restrictive. Manyvariations of the invention will become apparent to those of skill inthe art upon review of this disclosure. The scope of the inventionshould, therefore, be determined not with reference to the abovedescription, but instead should be determined with reference to theappended claims along with their full scope of equivalents.

1. A computerized method for analyzing the expression of splice variantscomprising: Obtaining probe intensities, wherein the probe intensitiesreflect the hybridization of nucleic acid probes with various genestructures; Modeling probe intensities across multiple experiments usinggene structures as constraints; Estimating the levels of the splicevariants based upon model parameters.
 2. The method of claim 1 whereinthe model parameters are obtained through maximum likelihood estimationprocess.
 3. The method of claim 1 wherein the nucleic acid probes areoligonucleotide probes.
 4. The method of claim 3 wherein the theoligonucleotide probes are immobilized on a substrate.